Several recent studies of amplicon taxonomic assignment methods have suggested that training Naive Bayes taxonomic classifiers against only the region of a sequence that was amplified, rather than a full length sequence, will give better taxonomic assignment results (Mizrahi-Man et al. 2013, Werner et al. 2012).
This recipe shows how an alignment can be trimmed to cover only a region bound by a pair of primers, using scikit-bio's functionality for sequencing matching. The list of skbio.DNA
objects that results from this process could be used to train a Naive Bayes classifier.
First we'll load an alignment into a skbio.TablularMSA
object (MSA stands for multiple sequence alignment).
We're going to work with the qiime-default-reference so we have easy access to some sequences. If you want to adapt this recipe to create a region specific reference collection for another set of sequences, just assign the reference_alignment_filepath
variable to a new filepath.
import qiime_default_reference
import skbio
reference_alignment_filepath = qiime_default_reference.get_template_alignment()
reference_alignment = skbio.TabularMSA.read(reference_alignment_filepath, constructor=skbio.DNA)
reference_alignment
TabularMSA[DNA] ----------------------------------------------------------------------- Stats: sequence count: 4797 position count: 7682 ----------------------------------------------------------------------- --------------------------------- ... --------------------------------- --------------------------------- ... --------------------------------- ... --------------------------------- ... --------------------------------- --------------------------------- ... ---------------------------------
Next, we'll define the forward and reverse primers as DNA
objects. The primers that we're using here are pulled from Supplementary File 1 of Caporaso et al. 2012. Note that we're reverse complementing the reverse primer when we load it here. scikit-bio does not automatically try to match the reverse or reverse complement of a sequence, as some programs like BLAST and uclust do (for the sake of computational efficency).
fwd_primer = skbio.DNA("GTGCCAGCMGCCGCGGTAA", {'id':'fwd-primer'})
rev_primer = skbio.DNA("GGACTACHVGGGTWTCTAAT", {'id':'rev-primer'}).reverse_complement()
These primers contain some degeneracies, or characters representing multiple bases. In practice, this means that each of these primer sequences actually represents a collection of sequences, once the degeneracies are expanded. We can see what non-degenerate sequences are represented by our degenerate forward primer sequence as follows:
for nondegenerate_sequence in fwd_primer.expand_degenerates():
print(nondegenerate_sequence)
GTGCCAGCCGCCGCGGTAA GTGCCAGCAGCCGCGGTAA
We're going to match any of the non-degenerate variants of each primer in this example.
The typical way to approach the problem of finding the boundaries of a short sequence in a longer sequence would be to use pairwise alignment. But, we're going to try a different approach here since pairwise alignment is inherently slow (it scales quadratically). Because these are sequencing primers, they're designed to be unique (so there shouldn't be multiple matches of a primer to a sequence), and they're designed to match as many sequences as possible. So let's try using regular expressions to match our sequencing primers in the reference database. Regular expression matching scales linearly, so is much faster to apply to many sequences. The scikit-bio GrammaredSequence
objects contain a method, to_regex
, that allows for conversion of a degenerate (or nondegenerate) sequence into a regular expression.
fwd_primer.to_regex()
re.compile(r'GTGCCAGC[CA]GCCGCGGTAA', re.UNICODE)
rev_primer.to_regex()
re.compile(r'ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC', re.UNICODE)
We can use these to create a new regular expression that will match the sequence, excluding the forward and reverse primers (since those are nearly the same in all sequences, they won't be useful in training a taxonomic classifier, so we choose not to include them).
regex = '{0}(.*){1}'.format(fwd_primer.to_regex().pattern,
rev_primer.to_regex().pattern)
print(regex)
GTGCCAGC[CA]GCCGCGGTAA(.*)ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC
Next, let's apply this regular expression to all of our reference sequences. This will let us find out how many reference sequences our pattern matches, and the start and stop positions of each match.
import pandas as pd
starts = []
stops = []
seq_count = 0
match_count = 0
for seq in reference_alignment:
seq_count += 1
for match in seq.find_with_regex(regex, ignore=seq.gaps()):
match_count += 1
starts.append(match.start)
stops.append(match.stop)
starts = pd.Series(starts)
stops = pd.Series(stops)
match_percentage = (match_count / seq_count) * 100
print('{0} of {1} ({2:.2f}%) sequences have exact matches to the regular expression.'.format(match_count, seq_count, match_percentage))
3033 of 4797 (63.23%) sequences have exact matches to the regular expression.
Since we're matching our aligned reference sequences against an alignment, finding matches in around 60% of our sequences gives us an idea of how to slice all of our sequences, since the purpose of a multiple sequence alignment is to normalize the position numbers across all of the sequences in a sequence collection. One problem with matching against an alignment though is that the gaps in the alignment could make it harder to match our regular expression as the gaps would disrupt our matches. We get around this using the ignore
parameter to DNA.find_with_regex
, which takes a boolean vector (a fancy name for an array or list of boolean values) indicating positions that should be ignored in the regular expression match.
We can next look at the distribution of start positions and stop positions for each match. As you can see, these nearly always match the same position, and therefore tell us where we want to slice our aligned sequences to extract the region between our primers.
starts.describe()
count 3033.000000 mean 2263.022090 std 1.916592 min 2220.000000 25% 2263.000000 50% 2263.000000 75% 2263.000000 max 2358.000000 dtype: float64
stops.describe()
count 3033.000000 mean 4051.132542 std 7.281395 min 4050.000000 25% 4051.000000 50% 4051.000000 75% 4051.000000 max 4452.000000 dtype: float64
The positions that we want to slice our alignment at can be found as follows:
fwd_primer_end = int(starts.mode())
rev_primer_start = int(stops.mode())
Next, we're ready to filter out all positions outside of this range. We do this by specifying that range using TabularMSA.iloc
. This gives us a TabularMSA
that has the same number of sequences that we started with, but fewer positions in each Sequence
.
filtered_reference_alignment = reference_alignment.iloc[..., fwd_primer_end:rev_primer_start]
filtered_reference_alignment
TabularMSA[DNA] ----------------------------------------------------------------------- Stats: sequence count: 4797 position count: 1788 ----------------------------------------------------------------------- T--AC---AG-AG-GGT-GCA-A-G-CG-TTAA ... ----------G-TGGG-GAG-C-A-AACA--GG T--AC---GA-AG-GGG-GCT-A-G-CG-TTGT ... ---------GTGGGGG--AG-C-G-AACA--GG ... C--AC---CG-GC-AGC-TTA-A-G-TG-GTGA ... ----------C-AGGG-GAG-C-G-AACG--GG T--AC---CA-GC-TCC-GCG-A-G-TG-GTTG ... ----------T-GGGG-GAG-C-A-AAGG--GG
Because Naive Bayes classifiers are generally trained on unaligned sequences, we'd also want to remove all gaps from the alignment to get the unaligned sequences. These are the sequences that we'd want to use to train our classifier.
filtered_reference_sequences = [e.degap() for e in filtered_reference_alignment]
filtered_reference_sequences[0]
DNA --------------------------------------------------------------------- Metadata: 'description': '' 'id': '1111561' Stats: length: 253 has gaps: False has degenerates: False has definites: True GC-content: 52.57% --------------------------------------------------------------------- 0 TACAGAGGGT GCAAGCGTTA ATCGGATTGA CTGGGCGTAA AGGGCGCGTA GGCGGTAAGA 60 TAAGTCAGAT GTTAAAAACC CGAGCTCAAC TTGGGGACTG CATTTGAAAC TATCTCACTA 120 GAGTACAGTA GAGGAGAGCG GAATTTCCGG TGTAGCGGTG AAATGCGTAG ATATCGGAAG 180 GAACACCAGT GGCGAAGGCG GCTCTCTGGA CTGACACTGA CGCTGAGGCG CGAAAGCGTG 240 GGGAGCAAAC AGG
filtered_reference_sequences[-1]
DNA --------------------------------------------------------------------- Metadata: 'description': '' 'id': '4483258' Stats: length: 253 has gaps: False has degenerates: False has definites: True GC-content: 66.01% --------------------------------------------------------------------- 0 TACCAGCTCC GCGAGTGGTT GGGGCGTTTA CTGGGCCTAA AGCGCCCGTA GCCGGCCCGG 60 TAAGTCACCC CTGAAATCCA TGGGCTCAAC CCATGGGCCG GGGGTGATAC TGCCGGGCTT 120 GGGGGTGGGA GAGGCCGAGG GTACTCCCGA GGTAGGGGCG AAATCCGATA ATCTCGGGAG 180 GACCACCAGT GGCGGAAGCG CTCGGCTGGA ACACGCCCGA CGGTGAGGGG CGAAAGCTGG 240 GGGAGCAAAG GGG